Welcome to the final Reinforcement Learning Notebook. In this part you will implement the Deep Q-Learning Algorithm that was used by Mnih et al. to play Atari Video games. The resulting agent is called Deep Q-Network agent (or shorter DQN agent) because it uses a Deep Neural Network to approximate the value function (instead of saving it in a table).
Code-Cells (Cell->Cell Type->Code) and all Text as Markdown in Markdown-CellsE-Mail your complete Notebook to maucher@hdm-stuttgart.de until the start of the next lecture. One Notebook per Group is enough. Edit the teammember table below.
Important: Also attach a HTML version of your notebook (File->Download as->HTML) in addition to the .ipynb-File.
| Teammember | |
|---|---|
| 1. | Christopher Caldwell |
| 2. | Fabian Müller |
| 3. | An Dang |
Some good reading resources are:
In the last notebook you have implemented the model-free Q-Learning algorithm and solved the full reinforcement learning problem by learning from samples. In this context, full refered to the fact that we dont have acess to the world model and model-free to the fact that we have not tried to learn that model. Furthermore, Q-Learning performed online updates to the policy, meaning that we have adjusted the policy online after every time step. Finally Q-Learning is an off-policy algorithm because we followed an e-greedy behavior policy while we have performed updates according to a greedy target policy. Now we will tackle the curse of dimensionality by approximating the value function instead of saving it explicitly in a table.

Before we proceed to the solution, let us quickly revise the actual problem that we are trying to solve. Consider the task of learning to play a video game given only the raw game screen as input. This is similar to how humans would play the game. Since the game screen is typically represented as raw pixels, this leaves us with a really high dimensional input or state space because every change of pixels represents a new and distinct state of the game, even if the change seems completely insignificant to you! Remember, the agent has no real knowledge of the game (or world model). Clearly it is infeasible to store every possible state of pixle combinations in a table. See Reinforcement Learning: An Introduction chapter 16.5 for a comprehensive discussion.

The problem that is solved by Deep Reinforcement Learning (in the case of DQN) is how to learn a mapping from a high dimensional input space to action values. This mapping represents the value function and can be used in a policy, e.g. to choose the best action with the highest value.
First of all, a lookup table can mathematically be seen as a very simple form of a function, i.e. a direct mapping of values (hence the name value function). However, for the reasons explained above, this approach does not scale to high dimensional input spaces. A typical solution to this problem is to replace the perfect but intractable lookup table with a more complicated function that only approximates the true value function but is computationally tractable. In the case of DQN we choose a deep neural network as our function approximator. Formally, this new function is denoted as $\hat{Q}$ and we write
$$\begin{eqnarray} \hat{Q}(s,a,\theta) \approx Q_{\pi}(s,a) \end{eqnarray}$$where $\theta$ are the parameters of the neural network. In other words, the value function now depends on those parameters and the task of finding and optimal value function turns into the task of finding an optimal set of parameters for the network. Fortunately, we know how to train and optimize a neural network with SGD and backpropagation given an appropriate loss function. Inside the RL-framework we can use the TD-error as the loss function. Formally we optimize:
$$\begin{eqnarray} L_i(\theta_{i}) = \Big( \underbrace{r_{t+1} + \gamma \max_a Q(s_{t+1}, a; \theta_i) - Q(s_t, a_t;\theta_i)}_{TD-error} \Big)^2 \end{eqnarray}$$Note that in order to obtain any action-values, we now need to perform a forward pass through the network. In practice, this means two forward passes before we can calculate the loss, one pass for the value of $Q(s_t, a_t;\theta_i)$ and another one to calculate the value of $\max_a Q(s_{t+1}, a; \theta_i)$. More details on that later.
So far, so good. By using the TD-error as loss function we can train the network in a supervised learning like setup. Sadly it is not that easy. Remember that in supervised learning we assumed the data to be independent and identically distributed (iid-data) in order for SGD to work properly. This assuption does not hold in reinforcement learning where subsequent data is highly correlated and in contrast, depends strongly on the agents last choice of actions. This inherent sequential property, in combination with an off-policy algorithm and a non-linear function appoximator such as a neural network, results in the problem that the learnable network parameters are at risk to oscillate or even diverge catastrophically during training. In theory, there is no convergence guarantee whatsoever. In practice, Mnih et al. found two major ways in which the training process can be stabilized:
Experience Replay - This idea introduces a so called replay buffer $\mathcal{D}$ which stores the last $N$ state transitions as experience tuples $(S,A,R,S')$. In other words, the agent saves its recent history to a buffer. This way, experience can be reused and the correlation between samples can be broken by drawing random minibatches of experience $U(\mathcal{D})$ from it during the training.
Fixed Q-Targets - The second idea is to keep a separate set of parameters $\theta^{−}$ for calculating the TD target. This set is basically a copy of $\theta$ that is held fixed for some time $t$ and periodically gets swapped with the current parameter values in order to allow progress. Mnih et al. have shown that updating $\theta$ towards such fixed Q-targets is another effective way to stabilize the training process. In practice, this means that we have basically two separate networks which we will distinguish by their different set of parameters $\theta^{-}$ and $\theta$. We will refer to them respectively as Target- and Q-Network.
As a result, the Q-learning update of DQN at iteration $i$ uses the following loss function:
$$\begin{eqnarray} L_i(\theta_{i}) = \mathbb{E}_{(s,a,r,s') \sim U(\mathcal{D})} \Bigg[ \Bigg( r_{t+1} + \gamma \max_a \underbrace{Q(s_{t+1}, a; \theta^{-}_{i})}_{Target-Network} - \underbrace{Q(s_t, a_t;\theta_i)}_{Q-Network} \Big)^2 \Bigg] \end{eqnarray}$$And thats it! We can use this update rule inside the Q-Learning algorithm to train a Deep Q-Network with SGD as we do in supervised learning. The corresponding Deep Q-Learning algorithm is given in the next part.
As in the previous notebooks we will implement the DQN algorithm step by step. While the original DQN architecture was a CNN trained on Atari games, we will choose a much simpler problem and architecture. This way you can verify and debug your implementation much faster (in minutes vs hours...). However, the algorithm itself is still the same and extending it should be straightforward after completing the notebook. Though, this is left to the further ideas part depending on your time and motivation.
The following is an overview of all the parts you need. Use it as a checklist if you get lost. Like with Q-Learning, try first to verify that all the sub-parts are working as expected. If you are confident, integrate them iteratively into the main loop. There is no single best approach how to proceed so feel free to jump back and forth between the cells as you like.
You will use the OpenAI Gym environment to solve a classic control task known as Cart Pole Balance. The great thing about the gym environment is that it offers a common interface to many different environments. That way you can easily test your algorithms on different tasks, e.g. switch from an easy one like CartPole to more challenging ones like an Atari game etc. ;)
For now, we will solve the CartPole-v0 task.
gym.make('CartPole-v0') returns a new game.action_space and observation_space variables.reset() - returns an initial observation.step() - takes an action int, returns an observation, reward, game_over, info tuple.render() - renders the current game state.close() - call this after the last episode has ended to clean up.import gym
import numpy as np
game = gym.make("CartPole-v0")
observation_space = game.observation_space.shape[0]
action_space = game.action_space.n
i=0
while i in range(0,4):
state = game.reset()
state = np.reshape(state, [1, observation_space])
print(state)
x = 0
while x in range(0,200):
state_next, reward, terminal, info = game.step(np.random.randint(0,action_space))
#DEBUG print("State: ",state_next)
#DEBUG print("Reward: ",reward)
#DEBUG print("Terminate? ",terminal)
#DEBUG print("Info: ",info)
if terminal:
#DEBUG print("END OF GAME!!!!!!!")
x=201
#game.render()
x+=1
i+=1
game.close()
The replay buffer should store the last $N$ experience tuples. This is basically a FIFO queue and practically, python offers such a data structure called deque. If initialized with a maxlen parameter, deque's append method will pop items from the left automatically when the list grows beyond the given maxlen. This is exactly what we want and you can implement it in just a few lines of code! The replay buffer should have the following methods:
__init__ constructor, initializing an internal deque with a given maxlen or $N$ or better, call it buffer_size.add method, appending a new [state, action, reward, next_state, done] tuple. (done is the game_over information) sample method, sample a random batch of training data of size batch_size. You can use random's sample method for that.Use the cells below to test your implementation, e.g. by filling it with some integers in a loop, check whats in the queue and test the sampling method etc.
import random as r
from collections import deque
class ReplayBuffer():
def __init__(self, buffer_size):
self.buffer = deque([], maxlen=buffer_size)
def add(self, state, action, reward, next_state, done):
self.buffer.append([state, action, reward, next_state, done])
#print(self.buffer)
def sample(self, sample_size):
return(r.sample(self.buffer, sample_size))
# TESTING the replay buffer
size = 3
replay_buffer = ReplayBuffer(size)
i = 0
while i < size+5:
replay_buffer.add(1,2,3,4,i)
i+=1
replay_buffer.sample(3)
Last time, we calculated the current epsilon value inside the main loop. This time we need a little bit more control so let's create a class for that task. The reason for that is that we have to pre fill the replay buffer with some initial random experience before we can sample from it and start with the actual training. We want to control the amount of initial experience with a pre_train_steps variable. During this time, the schedule should return the start_epsilon value so that the agent behaves fully random. After that, the normal decay should be applied. The implementation needs two methods:
__init__ constructor, takes all hyper parameters for the schedule such as start_epsilon, final_epsilon, pre_train_steps, final_exploration_step, pre calculate the decay value per step here.value method, takes a time step t and returns a correpsonding epsilon value. If t is smaller or greater than the pre_train_steps or final_exploration_step return the fixed values accordingly. In between calculate the decayed epsilon value at time t.Use the code in the cell below to test and visualize your schedule.
import numpy as np
class LinearSchedule():
def __init__(self, start_epsilon, final_epsilon, pre_train_steps, final_exploration_step):
# You code comes here
self.start_epsilon = start_epsilon
self.final_epsilon = final_epsilon
self.pre_train_steps = pre_train_steps
self.final_exploration_step = final_exploration_step
self.decay = (self.start_epsilon - self.final_epsilon) / (self.final_exploration_step - self.pre_train_steps)
def value(self, t):
if t<=self.pre_train_steps:
return self.start_epsilon
else:
if t>self.final_exploration_step:
return self.final_epsilon
else:
current_epsilon = self.start_epsilon-self.decay*(t-self.pre_train_steps)
return current_epsilon
# TESTING the epsilon schedule
%matplotlib inline
import matplotlib.pyplot as plt
schedule = LinearSchedule(1.0, 0.1, 100, 1000)
test_points = [schedule.value(t) for t in range(1100)]
plt.plot(test_points)
The original DQN agent included a CNN as shown in the theory part of this notebook. For our task however, a simple MLP with only one hidden layer should be enough. Starting that simple will help you to get other implementation details right. Later on you can easily scale up and switch the MLP for a more powerfull network.
While the architecture will be easy, the DQN algorithm requires us to keep basically two separate networks, namely, a main Q-Network and a second Target-Network. In addition, Tensorflow requires us to keep a reference to every node we want to calculate with the sess.run command. For those reasons we will build a generic and reusable DQNetwork class and bind all graph nodes to the object. Later we can then simply use the instance objects to reference specifc nodes in a clean and readable way.
We will implement the DQNetwork class below in two steps. The first part denoted as basic Deep Q-Network, includes the actual network. The second part denote as Q-Learning Calculations, includes all the additional calculations to train the network.
This part is identical for the main Q- and the Target-Network. It includes all trainable variables of the graph! The model should be a fully connected feedforward network with one hidden layer of size $64$ and ReLU activations. For the CartPole task, the resulting MLP will consequently be later of size num_inputs=4, num_hidden=64, num_outputs=2. For the generic DQNetwork class however, let the user specify those values as parameters.
Now create a self.best_action node which should take the output layer (or q-values) and return the indice of the maximum action value. You can use tf.argmax for that. You can query this node in the e-greedy action selection later.
Create another node, self.max_q which does the same thing but returns the value of the maximum action value. You can use tf.reduce_max for that. You will need this node for the calculation of the TD-target later.
So far, everything should have been very straightforward.
This part is different for the Q- and the Target-Network. It includes all the necessary calculations for training the network. The code skeleton below shows how to constrain the graph creation with two simple if statements.
First, the Target Stream. This part will later calculate the TD-Target (or $y_i$) with the following equation:
$$\begin{eqnarray} y_i = r_{t+1} + \gamma \max_a \underbrace{Q(s_{t+1}, a; \theta^{-}_{i})}_{Target-Network} \end{eqnarray}$$Remember that we will train with mini-batches sampled from the replay buffer. For that reason, the reward values will be provided by mini-batches. This means we have to feed them externally, so let's create a placeholder node for those. Next, gamma will be a constant so let's create one in tensorflow and let the user specify it as a parameter at creation time of the network. Next, $s_{t+1}$ are the next_states from the buffer. This is not important here but you have to feed them correctly later! To get the maximum action value, use the self.max_q node that you have created earlier. Finally, there is a small detail we have not talked about yet. In the rare case that the next state is a final state, i.e the game is over, only the reward should be taken into account. To realize this here is a little trick: create another placeholder for the boolean done values from the mini-batch. Then, multiply the right-hand side of the equation with tf.abs(self.done - 1). For clarity call the final node which implements the equation above self.td_target!
Note that this is a lot of text but basically 4 lines of code!
Second, the Q Stream. This part will calculate the full TD-Error and optimize the mean squared error on the mini-batch.
$$\begin{eqnarray} L_i(\theta_{i}) = \mathbb{E}_{(s,a,r,s') \sim U(\mathcal{D})} \Bigg[ \Bigg( y_i - \underbrace{Q(s_t, a_t;\theta_i)}_{Q-Network} \Big)^2 \Bigg] \end{eqnarray}$$Now first, create a placeholder for the td-target ($y_i$). These values have been calculated by the Target-Network and we must feed them again to the main Q-Network. Yes, its really two separate networks even though they share some code here. Next we need the value of $Q(s_t, a_t;\theta_i)$. We can query all action values by feeding the mini-batch of states ($s_t$) from the buffer to the network (first part of the network). The problem is that we only want to select the action value of the action that was actually taken. Luckily we have this information as part of the mini-batch. We can feed it with an additional placeholder, let's call it self.actions. The idea is now to mask the output of the network with a one-hot encoded representation of the action indices from the placeholder. To do this use tf.one_hot and call the resulting node self.actions_onehot or similar. Finally multiply the one_hot vector with the output of the network and call tf.reduce_sum on the result to obtain a clean list of the action values we want. The resulting node, let's call it self.Q holds the values of $Q(s_t, a_t;\theta_i)$ for the complete mini-batch.
If you feel uncomfortable with this trick create a toy example in a separate cell in order to understand whats going on here.
Great, now the rest should be straightforward again. Square the difference of td_target and Q to obtain the TD-error and use tf.reduce_mean on that to obtain the expected loss of the mini-batch. This corresponds to $\mathbb{E}_{(s,a,r,s') \sim U(\mathcal{D})}$ in the formula. Create an optimizer node such as the AdamOptimizer and again, let the user specifiy the learnin rate as a parameter at creation time of the network. Finally minimize the loss and call this final node something like self.updateModel or self.train.
Note that this is again a lot of text but only something like ~8 lines of code!
import tensorflow as tf
import tensorflow as tf
class DQNetwork():
def __init__(self, scope, num_inputs, num_hidden, num_outputs, gamma, learning_rate):
self.scope = scope
with tf.variable_scope(self.scope):
# ---------------------
# Basic Deep Q-Network
# ---------------------
# You code comes here
self.state = tf.placeholder(tf.float32, shape=[None,num_inputs], name="StatePlaceholder")
self.hidden = tf.layers.dense(inputs=self.state, units=num_hidden, activation=tf.nn.relu, name="hiddenLayer")
self.logits = tf.layers.dense(inputs=self.hidden, units=num_outputs, name="logits")
self.best_action = tf.argmax(self.logits, name="best_action", axis=1)
self.max_q = tf.reduce_max(self.logits)
# ------------------------
# Q-Learning Calculations
# ------------------------
if scope == 'Target':
# You code comes here
self.reward_placeholder = tf.placeholder(tf.float32, shape=None, name="RewardPlaceholder")
self.gamma = tf.constant(value=gamma, name="GammaConstant")
self.done_placeholder = tf.placeholder(tf.int32, shape = None, name="DonePlaceholder")
self.td_target = self.reward_placeholder + ((gamma*self.max_q)*tf.cast(tf.abs(self.done_placeholder - 1), tf.float32))
if scope == 'Q':
# You code comes here
self.td_target_placeholder = tf.placeholder(tf.float32, shape=None, name="TD-TargetPlaceholder")
self.actions = tf.placeholder(tf.uint8, shape=None, name="ActionTakenPlaceholder")
self.actions_onehot = tf.one_hot(self.actions, depth=num_outputs)
self.masked_logits = tf.multiply(self.actions_onehot, self.logits)
self.Q = tf.reduce_sum (self.masked_logits)
self.td_error = self.td_target_placeholder - self.Q
self.expected_loss = tf.reduce_mean(self.td_error**2)
self.optimizer = tf.train.AdamOptimizer(learning_rate, name="AdamOptimizer")
self.update_model = self.optimizer.minimize(self.expected_loss, name="minimize_loss")
# A cell for testing
As in the Q-Learning notebook, let us encapsulate the action selection into a separate method. This time however, selecting a greedy max action requires us to perform a forward pass through the Q-Network. The method per se remains as simple as in the Q-Learning case. In order to perform the forward pass of the network though, we have to hand over a reference to the current Tensorflow session object from the main loop etc.
Note that the sess.run call will most certainly return a list, containing a single action indice. Make sure to unpack it properly before returning it. You can test this in separate cell or in the main loop by using a fixed epsilon value.
import numpy as np
def choose_egreedy_action(session, current_state, network, epsilon):
# Create Probability variable for choosing a random action (first Value = exploration) and a learned action (second Value = exploitation)
probability = [epsilon, 1-epsilon]
#DEBUG print("Probability: ", probability)
random_method = [0,1]
#np.random.choice chooses one ethod according to the probability
action_choice = np.random.choice(random_method, p=probability)
# Exploration Choice | action_choice = 0 | A High Epsilon value results in a high probability for exploring action
if action_choice == 0:
# Choose a random Action out of two possibilities (Action 0 and Action 1)
action = np.random.choice(2)
# Exploitation Choice | action_choice = 1 | A low Epsilon value results in a high probability for exploiting action
else:
# Reshape current state to fit for the Network (the first dimesion for batch size)
current_state = np.reshape(current_state, [1, observation_space])
# Get an action by performing a forward pass through the network.
# max_q and logits were also calculated for DEBUGGING reasons
action, max_q, logits = session.run([network.best_action, network.max_q, network.logits], feed_dict={network.state:current_state})
#DEBUG: print("Max_q: ",max_q)
#DEBUG: print("Logits: ",logits)
# the network returned a one dimensional array with the best predicted action stored. The next line
action = action[0]
#DEBUG: print(action_choice, " : ",action)
return action
As explained in the theory part, the Target-Network will be fixed for some time $C$ while the main Q-Network gets update every training/update step. Every $C$ time steps however we want to switch the networks or better, update the Target-Network with the latest information from the Q-Network. This basically means that we want to copy over all weights from the Q-Network and assign them to the Target-Network. The Q-Network itself remains unchanged. We will control this freeze frequency later inside the main loop and execute the copy process only every $C$ time steps.
Assigning new values to variables in Tensorflow can be done with the tf.assign method. But, as everything in TensorFlow, these assign operations will be tensor nodes and we have to create them before we start the session. Since they don't belong to any of the networks let us do this in an extra method.
Note that in (before) the main loop you will have to create both networks first and then hand them to the
get_update_target_opsmethod to obtain the list of assign operations!
tf.trainable_variables(scope=).sorted and the attrgetter helper, e.g Q_vars = sorted(Q_vars, key=attrgetter('name')).update_target_expr.t_var.assign(q_var). Append them to the expression list.sess.run(update_target_expr) to run all of the assign operations.Use the cells below to test your implementation with some toy networks!
# Zip demo
x = [1,2,3]
y = [4,5,6]
for x,y in zip(x,y):
print(x,y)
from operator import attrgetter
def get_update_target_ops(Q_network, Target_network):
# You code comes here
# 1. get the trainable variables per network
q_train_vars = tf.trainable_variables(scope=Q_network.scope)
target_train_vars = tf.trainable_variables(scope=Target_network.scope)
# 2. sort them with sorted(list, key=attrgetter())
q_train_vars = sorted(q_train_vars, key=attrgetter("name"))
target_train_vars = sorted(target_train_vars, key=attrgetter("name"))
#print(q_train_vars)
# 3.create a new list to append all assign operations
update_target_expr = []
for q_var, t_var in zip (q_train_vars, target_train_vars):
# create an individual assign task for each step in the loop
assign_task = t_var.assign(q_var)
# append the assign operations to the update target expression list
update_target_expr.append(assign_task)
return update_target_expr
# TESTING get_update_target_ops with some toy networks
tf.reset_default_graph()
Q1 = DQNetwork("Q1", 1,2,1, 0.9, 0.1)
Q2 = DQNetwork("Q2", 1,2,1, 0.9, 0.1)
Q1_vars = tf.trainable_variables(scope="Q1")
Q2_vars = tf.trainable_variables(scope="Q2")
update_target_expr = get_update_target_ops(Q1, Q2)
print("List of created Assign operations"), print(update_target_expr)
print("\n Q1 Variables"), print(Q1_vars)
print("\n Q2 Variables"), print(Q2_vars)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print("\n Q1 Values"), print(sess.run(Q1_vars))
print("\n Q2 Values"), print(sess.run(Q2_vars))
sess.run(update_target_expr)
print("\n Q2 Values AFTER network copy. Should now be identical to the values of Q1")
print(sess.run(Q2_vars))
Finally let's create a method to run the actual network training. Doing this in an extra method is not necessary per se but un-clutters the main loop a lot. The implementation requires the following three steps:
[]!) of observations, actions, rewards, next_observations and done. Make sure this true before creating corresponding feed_dicts! Depending on your replay buffer implementation you way find it handy to use zip again and apply the list operator with the map function, e.g. map(list, zip(*batch)). See the cell below for an UnZip demo.feed_dict and get the td_target values by running the Target-Network.feed_dict and run the update_model or train operation of the main Q-Network.Now we can simply call train inside the main loop every time we want to train the Q-Network.
# UnZip demo *
mini_batch = [[1,2,3], [1,2,3], [1,2,3]]
for i in zip(*mini_batch):
print(i)
def train(sess, Q, Target, buffer, batch_size):
# You code comes here
# 1. Sample from the replay buffer
sample = buffer.sample(batch_size)
separate_lists = list(map(list, zip(*sample)))
# save each element of the separate list into their own list
observations = separate_lists[0]
actions = separate_lists[1]
rewards = separate_lists[2]
next_observations = separate_lists[3]
done = separate_lists[4]
# 2. Retrieve the TD-target values from the Target-Network
# For each Batch in BatchSize
for i in range(batch_size):
# Reshape to fit Network
next_observations[i] = np.reshape(next_observations[i], [1, observation_space])
# Define a Feed Dictionary for td_target
## Feeding next_observation, rewards, done into Network
td_target_feed = {Target.state:next_observations[i], Target.reward_placeholder:rewards[i], Target.done_placeholder:done[i]}
td_target = sess.run(Target.td_target, feed_dict=td_target_feed)
# 3. Perform a training step by running the main Q-Network
# Reshape to fit Network
observations[i] = np.reshape(observations[i], [1, observation_space])
sess.run(Q.update_model, feed_dict={Q.state:observations[i], Q.td_target_placeholder: td_target, Q.actions:actions[i]})
Below you can see the Deep Q-Learning pseudo code from the original paper. This will help you to get at least the main parts of the algorithm right and should serve as a rough blue print of when to do what inside the loop. In practice, there are many more subtle details which are not mentioned explicitly. For that reason we provide you an additional checklist below the algorithm so you don't miss anything important. As in the last notebook we strongly recommend to start simple and work through the steps one by one! In addition, the checklist provides some sane default ranges for the hyperparameters to get you started. Try different settings and make it work :) If you have problems, better start with huge values and reduce them step by step as needed!
Hint: Personally I like to put the (hyperparameters and the preparation stuff) and the actual loop in separate code cells. You don't have to follow this convention. If you prefer one huge code cell, that's fine. Just do what works best for you!
Initialize target action-value function $\hat{Q}$ with random weights $\theta^{-}$
For $t = 1, T$ do
set
$
y_j = \begin{cases}
r_j & \text{if episode terminates at step } j + 1 \\
r_j + \gamma \max_a \hat{Q}(s_{j+1}, a; \theta^{-}) & \text{otherwise}
\end{cases} $
Perform a gradient descent step on $\big(y_j - Q(s_j,a_j;\theta)\big)^2$ with respect to the network parameters $\theta$
This is a helpful checklist without any claim to completeness. Depending on your implementation you may add, change or remove parameters as you like!
Preparation and hyper parameters
Epsilon Schedule
start_epsilon $1$final_epsilon $\in \{0.02,0.1\}$ pre_training_steps $\sim[32,...,10000]$final_exploration_step $\sim [100,...,40000]$Replay Buffer
buffer_size $N \in \{32,100,500,1000,10000,50000, ...?\}$ (Bigger is better but try small ones too!)Training
T $\in \{10k,20k,30k,40k,100k\}$training_freq $1$ - train the Q-Network only every $n$ steps. For now just use 1 as default.switch_networks $C \sim 500$ gamma $\in \{0.9, 0.99, 1\}$batch_size $32$learning_rate $0.001$Model
tf.reset_default_graph() before creating a new graphobservation_space and action_space from the game envnum_hidden $64$Q_network and T_network with scope "Q" and "Target"get_update_target_ops to something like update_target_networkInside the Loop
t becomes t > pre_training_stepsobservation = new_observation for $t+1$If the Loop runs without errors
Create insight
Save/Load the model - needed for the test evaluation later
saver = tf.train.Saver() - outside the sessionsaver.save(sess, "./some_path/model.ckpt") - with an active sessionsaver.restore(sess, "./some_path/model.ckpt") - with a new session. In this case tf.global_variables_initializer() is not required. A fitting graph definition however is. If there is none left in ram, e.g. if the kernel was restartet, make sure you recreate the graph definition before restoring (at least the main Q-Network). You will need it anyway to reference nodes in the sess.run calls later. import gym
import tensorflow as tf
# Create a new game
game = gym.make('CartPole-v0')
# Your code comes here
tf.reset_default_graph()
# Hyperparameters:
## Epsilon Schedule
start_epsilon = 1.0
final_epsilon = [0.02 , 0.1]
pre_training_steps = [700,5000]
## Replay Buffer
buffer_size = [500,10000]
## Training
### Max Trainsteps T
T = [10000,35000]
### Currently use 1 as Train Frequency 1 = 100% of steps are being trained
train_freq = 1
### Update the Target Network every 500 steps
switch_networks = 500
### Set Gamma
gamma = [0.9, 0.99]
### Set Batch Size
batch_size = 32
### Set Learning Rate
learn_rate = 0.001
##Model
hidden_layers = 64
#Create different Hyperparameter filled lists
hyperparameter_list = []
for m_final_epsilon, m_pretrain_steps, m_buffersize, m_T, m_gamma in [(m_final_epsilon, m_pretrain_steps, m_buffersize, m_T, m_gamma) for m_final_epsilon in final_epsilon for m_pretrain_steps in pre_training_steps for m_buffersize in buffer_size for m_T in T for m_gamma in gamma]:
current_list = [m_final_epsilon, m_pretrain_steps, m_buffersize, m_T, m_gamma]
hyperparameter_list.append(current_list)
# Plotting Values
epsilon_plot = []
episode_length_plot = []
episode_reward_plot = []
model_name = []
#To see if the GPU is being used or not
tf.test.is_gpu_available(
cuda_only=False,
min_cuda_compute_capability=None
)
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import itertools
def plot_model(epsilon_plot, episode_length_plot, episode_reward_plot, plot_name):
sns.set_style("darkgrid")
pallete = sns.color_palette()
epsilon_percentage=[]
for i,epsi in enumerate(epsilon_plot):
# Multiply with 100 - to show percentage
epsilon_percentage.append(epsi*100)
mean_episode_length_plot = [np.mean(episode_length_plot) for i in range(len(episode_length_plot))]
mean_episode_reward_plot = [np.mean(episode_reward_plot) for i in range(len(episode_length_plot))]
fig = plt.figure(figsize=(25,8))
# Plot return per episode
ax1 = fig.add_subplot(1,2,1)
ax1.plot(episode_reward_plot, color=pallete[0])
ax1.plot(mean_episode_reward_plot, color=pallete[1])
ax1.plot(epsilon_percentage, color=pallete[2])
ax1.legend(['Return','Mean','Epsilon'])
ax1.set_title("{}".format(plot_name), fontsize=14)
plt.ylabel("Return")
plt.xlabel("Episode")
#Saving Images
#plt.savefig('./img/{}.png'.format(plot_name))
# Show the plot
plt.tight_layout()
plt.show()
from tqdm import tqdm
current_name = ""
for current_index, current_hyperlist in enumerate(hyperparameter_list):
current_final_epsilon = current_hyperlist[0]
current_pretrain_steps = current_hyperlist[1]
current_buffersize = current_hyperlist[2]
current_T = current_hyperlist[3]
current_gamma = current_hyperlist[4]
temp_c_e = start_epsilon
final_exploration_step = [current_pretrain_steps*2, current_pretrain_steps*4]
for exploration_index,final_expl_step in enumerate(final_exploration_step):
current_epsilon_plot = []
current_episode_length_plot = []
current_episode_reward_plot = []
current_final_exploration_step = final_expl_step
current_name = "final-epsiolon_{0}|pretrain-steps_{1}|final-exploration-step_{2}|buffersize_{3}|T_{4}|gamma_{5}]".format(
current_final_epsilon,
current_pretrain_steps,
current_final_exploration_step,
current_buffersize,
current_T,
current_gamma)
print_index = current_index*2+exploration_index+1
print_all = len(hyperparameter_list)*len(final_exploration_step)
print(print_index,"/",print_all,": ",current_name)
tf.reset_default_graph()
with tf.Session() as sess:
current_epsilon_schedule = LinearSchedule(start_epsilon, current_final_epsilon, current_pretrain_steps, current_final_exploration_step)
replay = ReplayBuffer(current_buffersize)
Q_network = DQNetwork("Q",
game.observation_space.shape[0],
hidden_layers,
game.action_space.n,
current_gamma,
learn_rate)
Target_network = DQNetwork("Target",
game.observation_space.shape[0],
hidden_layers,
game.action_space.n,
current_gamma,
learn_rate)
update_target_network = get_update_target_ops(Q_network,Target_network)
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver(save_relative_paths=True)
state = game.reset()
state = np.reshape(state, [1, game.observation_space.shape[0]])
episode_steps = 0
reward_in_episode = 0
# Your code comes here
for t in tqdm(range(current_T)):
episode_steps+=1
current_epsilon = current_epsilon_schedule.value(t)
"""
DEBUGGING
if temp_c_e < current_epsilon:
print("Old: ", temp_c_e)
print("New: ", current_epsilon)
print("t: ",t)
print("Start epsilon: ",start_epsilon)
print("Final Epsilon: ",current_final_epsilon)
print("Pretrain Steps", current_pretrain_steps)
print("final Exploration Step", current_final_exploration_step)
"""
temp_c_e = current_epsilon
#DEBUG: print("current epsilon: ",current_epsilon)
action = choose_egreedy_action(session=sess, current_state=state, network=Q_network, epsilon=current_epsilon)
#DEBUG: print("action: ", action)
state_next, reward, terminal, info = game.step(action)
reward_in_episode+=reward
#DEBUG: print(episode, "-", t, state_next, reward, terminal, info)
replay.add(state, action, reward, state_next, terminal)
if terminal:
#DEBUG: print("End of Game")
current_epsilon_plot.append(current_epsilon)
current_episode_length_plot.append(episode_steps)
current_episode_reward_plot.append(reward_in_episode)
state = game.reset()
state = np.reshape(state, [1, game.observation_space.shape[0]])
episode_steps = 0
reward_in_episode = 0
else:
if t > current_pretrain_steps:
# Train and Update Target Network Variables
#DEBUG: print("Training")
train(sess,Q_network,Target_network,replay,batch_size)
if(t%switch_networks==0):
#DEBUG print("update Target network")
sess.run(update_target_network)
state = state_next
#saver.save(sess, "./models/{0}.ckpt".format(current_name))
epsilon_plot.append(current_epsilon_plot)
episode_length_plot.append(current_episode_length_plot)
episode_reward_plot.append(current_episode_reward_plot)
model_name.append(current_name)
plot_model(current_epsilon_plot,current_episode_length_plot,current_episode_reward_plot,current_name)
saver.save(sess, 'models/{0}.ckpt'.format(current_name))
import random as r
from collections import deque
class MeanBuffer():
def __init__(self, buffer_size):
self.buffer = deque([], maxlen=buffer_size)
def add(self, reward):
self.buffer.append(reward)
def length(self):
return(len(self.buffer))
def sample(self):
return(self.buffer)
# You code comes here
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import itertools
for index,model in enumerate(model_name):
print(model)
sns.set_style("darkgrid")
pallete = sns.color_palette()
epsilon_percentage=[]
for i,epsi in enumerate(epsilon_plot[index]):
# Multiply with 100 - to show percentage
epsilon_percentage.append(epsi*100)
mean_episode_length_plot = [np.mean(episode_length_plot[index]) for i in range(len(episode_length_plot[index]))]
mean_episode_reward_plot = [np.mean(episode_reward_plot[index]) for i in range(len(episode_length_plot[index]))]
mean_reward_ten_buffer = MeanBuffer(10)
mean_reward_hundred_buffer = MeanBuffer(100)
mean_reward_ten_list = []
mean_reward_hundred_list = []
for reward_index, reward in enumerate(episode_reward_plot[index]):
mean_reward_ten_buffer.add(reward)
mean_reward_hundred_buffer.add(reward)
#DEBUG print(sum(mean_reward_ten_buffer.sample()))
mean_reward_ten_list.append(sum(mean_reward_ten_buffer.sample())/mean_reward_ten_buffer.length())
mean_reward_hundred_list.append(sum(mean_reward_hundred_buffer.sample())/mean_reward_hundred_buffer.length())
fig = plt.figure(figsize=(25,8))
# Plot return per episode
ax1 = fig.add_subplot(1,2,1)
ax1.plot(episode_reward_plot[index], color=pallete[0], linewidth=0.5)
ax1.plot(mean_episode_reward_plot, color=pallete[1])
ax1.plot(mean_reward_ten_list, color=pallete[2], linewidth=3)
ax1.plot(mean_reward_hundred_list, color=pallete[3], linewidth=3)
ax1.plot(epsilon_percentage, color=pallete[4])
ax1.legend(['Return','Mean','Mean-10','Mean-100','Epsilon'])
ax1.set_title("Return per Episode", fontsize=14)
plt.ylabel("Return")
plt.xlabel("Episode")
plt.savefig('./img/{}.png'.format(model))
# Show the plot
plt.tight_layout()
plt.show()
In general, the evaluation of deep RL is discussed controversially among researchers since it remains unclear how to benchmark and compare such algorithms properly. Is the return or average return a good performance measure? How big is the impact of hyperparameters vs. general algorithm vs. implementation etc.? See the paper Deep Reinforcement Learning that Matters from Henderson et al. 2017 for a nice overview of these problems.
As part of this notebook however, we will evaluate our algorithm as done by the authors of DQN. The testing is very simple. Let the trained agent play the game $30$ times with an e-greedy policy with a fixed $\epsilon = 0.05$ and report the average high score (return).
evaluation_epsilon = 0.05.game = gym.make('CartPole-v0')
# You code comes here
eval_episodes = [50, 100, 250]
fixed_epsilon = 0.05
best_overall_mean_model=""
tmp_mean_current_reward = 0
for current_name in model_name:
tf.reset_default_graph()
saver = tf.train.import_meta_graph("./models/{0}.ckpt.meta".format(current_name))
temp_split_string = current_name.split("gamma_",1)[1]
#DEBUG print(temp_split_string)
temp_split_string = temp_split_string.split("]",1)[0]
#DEBUG print(temp_split_string)
current_gamma = float(temp_split_string)
with tf.Session() as sess:
Q_network = DQNetwork("Q",game.observation_space.shape[0],hidden_layers,game.action_space.n,current_gamma,learn_rate)
sess.run(tf.global_variables_initializer())
saver.restore(sess, tf.train.latest_checkpoint('./models/'.format(current_name)))
#DEBUG print(tf.all_variables())
#print(saver)
for episode_length in eval_episodes:
mean_reward_ten_list = []
mean_reward_hundred_list = []
current_rewards = []
mean_reward_ten_buffer = MeanBuffer(10)
mean_reward_hundred_buffer = MeanBuffer(25)
for epi in range(episode_length):
state = game.reset()
state = np.reshape(state, [1, game.observation_space.shape[0]])
x = 0
while x in range(0,200):
action = choose_egreedy_action(sess, state, Q_network, fixed_epsilon)
state_next, reward, terminal, info = game.step(action)
if terminal:
#print("End of Game!!!!")
#print(current_name, "\n",epi,": ",x)
mean_reward_ten_buffer.add(x)
mean_reward_hundred_buffer.add(x)
mean_reward_ten_list.append(sum(mean_reward_ten_buffer.sample())/mean_reward_ten_buffer.length())
mean_reward_hundred_list.append(sum(mean_reward_hundred_buffer.sample())/mean_reward_hundred_buffer.length())
current_rewards.append(x)
break
else:
#game.render()
x+=1
state = state_next
### Plotting a Graph
fig = plt.figure(figsize=(25,12))
mean_current_rewards = [np.mean(current_rewards) for i in range(len(current_rewards))]
#DEBUG print(sum(mean_reward_ten_buffer.sample()))
# Plot return per episode
ax1 = fig.add_subplot(1,2,1)
ax1.plot(current_rewards, color=pallete[0], linewidth=0.5)
ax1.plot(mean_reward_ten_list, color=pallete[1], linewidth=3)
ax1.plot(mean_reward_hundred_list, color=pallete[2], linewidth=3)
ax1.plot(mean_current_rewards, color=pallete[3])
ax1.legend(['Reward','Mean-10', 'Mean-100', 'Mean'])
ax1.set_title("{0}-{1}".format(current_name, episode_length), fontsize=10)
plt.ylabel("Return")
plt.xlabel("Episode")
plt.savefig('./output/{0}_{1}-Episodes.png'.format(current_name, episode_length))
# Show the plot
plt.tight_layout()
plt.show()
if tmp_mean_current_reward < mean_current_rewards[0]:
tmp_mean_current_reward = mean_current_rewards[0]
best_overall_mean_model = current_name
print("Best Model: ",best_overall_mean_model)
print("Best Reward: ", tmp_mean_current_reward)
For this experiment we chose to train 64 different Models.
These different models were the result of using 6 hyperparameters with 2 different values each
$n_{models} = 2^6 = 64$
The selected Hyperparameters with two differnet values were:
final epsilon [ 0.02 | 0.1 ]pre training steps [ 700 | 5000 ]final exploration step [ 2x Multiplier | 4x Multiplier ]buffer size [ 500 | 10000 ]T (Number of total training steps [ 10000 | 35000 ]gamma [ 0.9 | 0.99 ]For the hyperparameter final exploration step we chose to use a multiplier. We multiplied the current pre training steps either with 2 or with 4. That way we always had a relative epsilon decrease dependant on the number of pretraining steps.
final-epsiolon_0.02_pretrain-steps_700_final-exploration-step_2800_buffersize_10000_T_10000_gamma_0.99]

In this model we have a short Pretraining Phase of 700 steps. The Epsilon Decay is chosen to be a longer version by multipling the pretrain steps by 4.
$final exploration step = 4*pretrain steps = 4*700 = 2800$
The Buffersize has been chosen to be large with 10000 stored samples. At the same time the total Training Steps has been chosen low at also 10000. That means each step is actually stored inside the Replay Buffer. Gamma has been chosen at 0.99
During the first 130 Episodes, the Model does not achieve great results. But then, when Epsilon has reached a low value of around 0.2, the Model suddenly improves and at the end of the 10000 Steps, the model achieves great results. During the last few Episodes, this model achieves very high values. We can also see that the Mean 10 scores and Mean 100 scores constantly rise. But unfortunately, all of the good results happen in the last quarter of the training time.
Let us first take a look at the Evaluation of this model. Afterwards, we can take a look at the model with more Trainingsteps and compare both of them.
50 Episodes

100 Episodes

250 Episodes

In all 3 evaluations we can clearly see Bad results with a low Mean Reward Value of only around 9. This does seem odd. Let's investigate further.
Next we take a look at the Model with more Training Steps:
final-epsiolon_0.02_pretrain-steps_700_final-exploration-step_2800_buffersize_10000_T_35000_gamma_0.99]

Here we can see, that this model does have high peaks, but also lower values after having already reached a high peak. Another difference we can see is, that it takes this model longer to start getting better results. We have already reached the final epsilon value but it need some more episodes to achieve better results. Let's take a look at the evaluation
50 Episodes

100 Episodes

250 Episodes

We can see much better results here than in the T=10000 Steps variant. Now we reach a 68 Reward Mean. But we are still not close to the maximum. Let's see what happens if we reduce the buffersize:
final-epsiolon_0.02_pretrain-steps_700_final-exploration-step_2800_buffersize_500_T_35000_gamma_0.99]
50 Episodes

100 Episodes

250 Episodes

As expected a lower Buffersize of the Replay Buffer results in worse results. Lets see what happens if we increase the pretrain steps (we return the Buffersize to 10000)
final-epsiolon_0.02_pretrain-steps_5000_final-exploration-step_20000_buffersize_10000_T_35000_gamma_0.99]
50 Episodes

100 Episodes

250 Episodes

Similar to the first model, we have a good training graph, but it ends too soon. To see better results here, we would probably need to increase the Training steps. But we can also try to decrease the final explorationsteps
final-epsiolon_0.02_pretrain-steps_5000_final-exploration-step_10000_buffersize_10000_T_35000_gamma_0.99]
50 Episodes

100 Episodes

250 Episodes

Apparently decreasing the Final Exploration Step led to worse results. So lets return to our best model yet (lower pretrain steps with 4X multiplication and large buffer size) and increas the final epsilon value
final-epsiolon_0.1_pretrain-steps_700_final-exploration-step_2800_buffersize_10000_T_35000_gamma_0.99]

50 Episodes

100 Episodes

250 Episodes

Well, we have found a better working model. So why is it better to use more random based actions and not rely so much on the trained Network. Perhaps because we haven't trianed it long enough to get good results. Here we should definately try out a longer exploration time. Let's take a look yet again, with longer pretraining
final-epsiolon_0.1_pretrain-steps_5000_final-exploration-step_20000_buffersize_10000_T_35000_gamma_0.99]

50 Episodes

100 Episodes

250 Episodes

Well, not so good. Even though we let the agent explore longer, we got worse results. Of course we should include the fact into our analysis, that the exploitation steps were much less in this model. So perhaps a longer Timestep Training would be wide to test here. Another thing that could be tested would be to keep the pretraining steps quite low, but increase the final exploration steps. That way the decrease of epsilon would go slower and the agent would have more time adapt from random actions to predicted actions. Well, let's return to our best model yet and swap out our last hyperparameter. Gamma:
final-epsiolon_0.1_pretrain-steps_700_final-exploration-step_2800_buffersize_10000_T_35000_gamma_0.9]

50 Episodes

100 Episodes

250 Episodes

Well, what can we learn from all this? So far not very much. It would need more testing, especially with higher training steps. Another hyperparameter we have completele let out of the loop for this experiment is the learning rate of the network. Our best model out of all 64 has reached a good reward mean value of 161. But at the same time, if we take a look at the individual rewards per episode, we can clearly see, that the rewards are jumping around, which indicates, that this model doesn't really know what it's doing. This may come from a low buffersize, or short pretraining and short final exploration step.
For further experiment we definalt advise to increase the Training steps, keep a high buffer size and perhaps increas the final exploration step and play around with some other hyperparameters such as the learning rate and of course use a different network architecture for the Neural network.
We suppose it is possible, but at the same time hard to achieve. We would need to train a perfect agent. Epsilon Final should be at around 0, because we cannot rely on a random action to be good. It might ruin the perfect mean 100 score. I suppose it would need a better neural network architechture to predict better actions and a much longer training time. Another thing to try would be to try out different epsilon schedules. Possible examples could be:
$\alpha*sin(x)$
or
$argmax(0,(\alpha*sin(x)))$
or
$\alpha*|sin(x)|$
with $\alpha$ being a stretching aplitude hyperparameter. We could also stretch the sinus curve in the frequency direction. $\alpha$ could also be connected to the total Time steps so that it gradually decreases over time. That way we would not simply stop exploring after reaching the final epsilon value the first time, but instead start exploring again.
- As always, experiment with Hyperparameters/Network sizes etc. and reason about their effects/importance!
- Implement and experiment with new/different exploration schemes.
Because time ran out, we weren't able to implement all of our ideas, or try the Atari Implementation with Deep Reinforcement Learning.
So far you have implemented a basic DQN agent. For simplicity we have left out some important details which are crucial in order to play video games. If you are eager to do this anyway here are the missing parts.
First of all, switch the simple MLP with the following architecture from the paper. Note that there are no pooling layers in this CNN!

We have not talked about observability so far. Formally, Atari video games are Partially Observable Markov Decision Processes or POMDPs. This means that the game screen is not a sufficient observation to fully describe the underlying state and that the Markov Assumption does not hold. A simple example to makes this clear. Think about the game Pong. Given only one frame, the agent has no way of telling if the ball is currently moving from left to right or from right to left. For that reason the authors used the last 4 frames of the game as observation. This turns the POMDP into an MDP again. Furthermore they applied some more preprocessing steps to the game screens such as turning them into gray scale, rescaling, and taking the max out of two subsequent frames. Please see the methods section of the DQN paper for more details on that.
Hint: In order to implement this you have to keep some sort of frame buffer of the last 4 frames etc.!
In order to play faster, the authors trained the network only every $K=4$ time step. In between, the last taken action was repeated. This allows to play more games, e.g. gather more experience in less time since stepping the emulator forward is computationally cheaper than training the network. Again see the paper for details.
During training, the authors clipped the reward to the range $R \in \{-1,0,1\}$. Remember to remove this constraint during testing again to get the real high score
Another way to improve training stability was to clip the gradients (or better the squared L2 loss) to the range -1,1. In other words, only apply L2 if the error is inside this range and take a linear loss outside. This corresponds to a Huber Loss. Here you can find a TensorFlow implementations of that from the OpenAI baseline agents. Please read the Wikipedia to see what the function is doing.
The original DQN implementation used a slightly modified version of RMSProp as its optimizer. You dont have to implement this. It is perfectly fine to stay with Adam for instance. However, be aware that the learning rate is a really crucial parameter in this context. If at least for any, this is definetly the first (and probably the most important) hyperparameter for which you want to test different settings!
See the paper for a list of good default parameters. Due to the lengthy training times you may want to reduce the total amount of time steps the agent will be trained. You also may adjust the exploration accordingly. However, be aware that exploration time is very important. You may want to benchmark a very short training first and then do some rough calculations of how long it will take to train the agent for some $x$ time steps etc. Then plan some experiments.
[1] Tensorflow - tf.layers
[2] Tensorflow - tf.constant
[3] Simple Reinforcement Learning with Tensorflow Part 7: Action-Selection Strategies for Exploration
[4] Tensorflow - tf.argmax
[5] Save and Restore - Tensorflow
[6] No Rendering on Jarvis
[7] Split String - Stack Overflow
[8] Zip Folder in Jupyter Notebook - Stack Overflow
!tar chvfz notebook.tar.gz *